Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update MG negative sampling to return random samples distributed as specified #4885

Merged

Conversation

ChuckHastings
Copy link
Collaborator

@ChuckHastings ChuckHastings commented Jan 23, 2025

Modifies the new negative sampling interface so that when called from MG, each rank specifies how many samples they wish to receive, and to randomly distribute the samples across the calling GPUs.

Marked breaking as it changes the C++ interface... although nothing uses it yet.

Closes #4672

Copy link

copy-pr-bot bot commented Jan 23, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ChuckHastings ChuckHastings marked this pull request as ready for review January 23, 2025 00:52
@ChuckHastings ChuckHastings requested a review from a team as a code owner January 23, 2025 00:52
@ChuckHastings ChuckHastings self-assigned this Jan 23, 2025
@ChuckHastings ChuckHastings added improvement Improvement / enhancement to an existing function breaking Breaking change and removed cuGraph labels Jan 23, 2025
@ChuckHastings ChuckHastings added this to the 25.02 milestone Jan 23, 2025
Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we update the documentation?

@ChuckHastings
Copy link
Collaborator Author

Shouldn't we update the documentation?

Just pushed an update to the documentation.

@alexbarghi-nv
Copy link
Member

Can we hold this PR until I have a corresponding one in cuGraph-GNN?


if constexpr (multi_gpu) {
samples_per_gpu = host_scalar_allgather(handle.get_comms(), num_samples, handle.get_stream());
handle.sync_stream();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push.

Comment on lines +460 to +469
while (reduction > 0) {
size_t est_reduction_per_gpu = (reduction + comm_size - 1) / comm_size;
for (size_t i = 0; i < samples_per_gpu.size(); ++i) {
if (samples_per_gpu[i] > est_reduction_per_gpu) {
samples_per_gpu[i] -= est_reduction_per_gpu;
reduction -= est_reduction_per_gpu;
} else {
reduction -= samples_per_gpu[i];
samples_per_gpu[i] = 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic has a flaw.

Say reduction = 3 & comm_size = 2.
Then, est_reduction_per_gpu = 2
If samples_per_gpu[] > 2, reduction = 3 - 2 - 2 = overflow!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is (I hope) an uncommon case, usually I believe it is called with exact_number_of_samples == true. The code's probably not optimal... I was mostly looking for functional.

The next line should correct the edge condition you're concerned about. When i is 0, reduction = 3, comm_size = 2, so set_reduction_per_gpu = 2. We'll take the first branch and reduce reduction to 1. The next line will then trigger and reduce set_reduction_per_gpu to 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduce set_reduction_per_gpu to 1.

You mean est_reduction_per_gpu? This is set before the for loop. Once it is set to 2 in line 460, it will be 2 till the end of the for loop in line 416. Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I missed the line 470.

Comment on lines 499 to 521
rmm::device_uvector<int> reduced_ranks(comm_size, handle.get_stream());
rmm::device_uvector<size_t> reduced_counts(comm_size, handle.get_stream());

reduced_ranks.resize(
thrust::distance(reduced_ranks.begin(),
thrust::reduce_by_key(handle.get_thrust_policy(),
gpu_assignment.begin(),
gpu_assignment.end(),
thrust::make_constant_iterator(size_t{1}),
reduced_ranks.begin(),
reduced_counts.begin(),
thrust::equal_to<int>())
.first),
handle.get_stream());
reduced_counts.resize(reduced_ranks.size(), handle.get_stream());

rmm::device_uvector<size_t> send_count(comm_size, handle.get_stream());
thrust::fill(handle.get_thrust_policy(), send_count.begin(), send_count.end(), 0);
thrust::scatter(handle.get_thrust_policy(),
reduced_counts.begin(),
reduced_counts.end(),
reduced_ranks.begin(),
send_count.begin());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu_assignment is already sorted.

We can use thrust::upper_bound to find boundaries. This will require just comm_size binary searches and will be much faster.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the next push.

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ChuckHastings
Copy link
Collaborator Author

/merge

@alexbarghi-nv
Copy link
Member

@ChuckHastings doesn't this affect negative sampling in the C API? So once this merges, I'll have to update cugraph-pyg to call this function with the number of samples on each GPU.

@ChuckHastings
Copy link
Collaborator Author

@ChuckHastings doesn't this affect negative sampling in the C API? So once this merges, I'll have to update cugraph-pyg to call this function with the number of samples on each GPU.

This changes the semantic meaning of the parameter. Prior to this change, if you wanted this to run against an MG graph you would specify the total number of samples across all GPUs as the parameter for num_samples on each GPU (the same value on each GPU). With this change, each GPU should specify the number of samples that they want.

So, depending on what you implemented in cugraph-pyg, you may need to change the value that's passed for num_samples, yes.

@rapids-bot rapids-bot bot merged commit 5a41b41 into rapidsai:branch-25.02 Jan 30, 2025
79 checks passed
alexbarghi-nv added a commit to alexbarghi-nv/cugraph-gnn that referenced this pull request Jan 30, 2025
rapids-bot bot pushed a commit to rapidsai/cugraph-gnn that referenced this pull request Feb 3, 2025
Adds a heterogeneous link prediction example for cuGraph-PyG that uses the Taobao dataset.  Loosely based on the Taobao example from the PyG repository.

Adds ability to specify fanout as a dictionary to better align with PyG API.

Fixes a bug where the number of negative samples was calculated incorrectly, causing additional unwanted negative samples to be generated.

Updates the negative sampling call to match the new behavior added in rapidsai/cugraph#4885

Merge after rapidsai/cugraph#4898

Authors:
  - Alex Barghi (https://github.com/alexbarghi-nv)

Approvers:
  - Tingyu Wang (https://github.com/tingyu66)
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

URL: #104
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change cuGraph improvement Improvement / enhancement to an existing function
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Negative Sampling to request edges per worker
3 participants